Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement public dataset downloading via TFDS #582

Merged
merged 10 commits into from
Dec 23, 2021
Merged

Implement public dataset downloading via TFDS #582

merged 10 commits into from
Dec 23, 2021

Conversation

IRDonch
Copy link

@IRDonch IRDonch commented Dec 14, 2021

Summary

This adds a new download command, which downloads public datasets. In terms of syntax & semantics, it's comparable to the convert command, except that instead of a source directory and source format, it accepts a "dataset ID".

Currently, the downloading is done through an external library, TensorFlow Datasets. However, I expect that this might not always be the case: we might implement native downloading later, or add other download backends. Therefore, the dataset ID must begin with a namespace (currently, the only such namespace is tfds:) that signifies the download backend.

We could probably make use of the Datumaro plugin manager functionality to implement download backends, but I don't want to bother with it until there are at least two of them.

The way TFDS represents dataset items is much more flexible than the way Datumaro does it; an item can have features with arbitrary names and types. To map those features onto Datumaro's fixed annotation types, this code uses adapters (which are basically just sequences of predefined callbacks that each convert a certain feature or a metadata element into the Datumaro representation).

How to test

Try the download command. :-)

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

@IRDonch
Copy link
Author

IRDonch commented Dec 16, 2021

Tests are now implemented.

This will be used by a future `download` command.

Note that TFDS does not explicitly depend on TensorFlow, and neither does
this extra. This is because there are multiple TF distributions
(`tensorflow`, `tensorflow-gpu`, `tf-nightly`), and not having an explicit
dependency allows the user to decide which distribution to install.
@IRDonch IRDonch marked this pull request as ready for review December 17, 2021 14:17
@zhiltsov-max
Copy link
Contributor

I failed to download mnist dataset on windows behind proxy, unless I set the env variable. It is actual for pip too. Maybe, it should be documented somewhere, because it is not very typical for Windows users to set env vars manually.

Yes, I see new there is a user cache directory for TFDS.

Roman Donchenko added 5 commits December 20, 2021 18:52
In terms of syntax & semantics, this command is comparable to the `convert`
command, except that instead of a source directory and source format, it
accepts a "dataset ID".

Currently, the downloading is done through an external library, TensorFlow
Datasets. However, I expect that this might not always be the case: we might
implement native downloading later, or add other download backends.
Therefore, the dataset ID must begin with a namespace (currently, the only
such namespace is `tfds:`) that signifies the download backend.

We could probably make use of the Datumaro plugin manager functionality to
implement download backends, but I don't want to bother with it until there
are at least two of them.

The way TFDS represents dataset items is much more flexible than the way
Datumaro does it; an item can have features with arbitrary names and types.
To map those features onto Datumaro's fixed annotation types, this code uses
adapters (which are basically just sequences of predefined callbacks that
each convert a certain feature or a metadata element into the Datumaro
representation).
@IRDonch
Copy link
Author

IRDonch commented Dec 20, 2021

I failed to download mnist dataset on windows behind proxy, unless I set the env variable. It is actual for pip too. Maybe, it should be documented somewhere, because it is not very typical for Windows users to set env vars manually.

I guess that's fair, but where?

(Interestingly, TFDS itself doesn't seem to document it at all.)

@zhiltsov-max
Copy link
Contributor

I guess that's fair, but where?

I'll vote for just a comment in the online command docs.

@IRDonch
Copy link
Author

IRDonch commented Dec 21, 2021

I guess that's fair, but where?

I'll vote for just a comment in the online command docs.

Okay, I added one.

Use the format of the original dataset by default.

Rename `_TfdsAdapter.metadata_transformers` to `category_transformers` to
avoid confusion with `metadata`.
@@ -263,3 +265,37 @@ def compare_dirs(test, expected: str, actual: str):
def run_datum(test, *args, expected_code=0):
from datumaro.cli.__main__ import main
test.assertEqual(expected_code, main(args), str(args))

@contextlib.contextmanager
def mock_tfds_data(example=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, we need to place such functionality (mocks, fixtures, pytest-dependents etc.) in the tests/ directory. run_datum is also a good candidate for moving. This is not a blocker, but we should do it at some point.

@zhiltsov-max zhiltsov-max merged commit a28b32d into openvinotoolkit:develop Dec 23, 2021
@IRDonch IRDonch deleted the tfds-download branch September 9, 2022 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants