-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement public dataset downloading via TFDS #582
Conversation
Tests are now implemented. |
This will be used by a future `download` command. Note that TFDS does not explicitly depend on TensorFlow, and neither does this extra. This is because there are multiple TF distributions (`tensorflow`, `tensorflow-gpu`, `tf-nightly`), and not having an explicit dependency allows the user to decide which distribution to install.
I failed to download mnist dataset on windows behind proxy, unless I set the env variable. It is actual for Yes, I see new there is a user cache directory for TFDS. |
In terms of syntax & semantics, this command is comparable to the `convert` command, except that instead of a source directory and source format, it accepts a "dataset ID". Currently, the downloading is done through an external library, TensorFlow Datasets. However, I expect that this might not always be the case: we might implement native downloading later, or add other download backends. Therefore, the dataset ID must begin with a namespace (currently, the only such namespace is `tfds:`) that signifies the download backend. We could probably make use of the Datumaro plugin manager functionality to implement download backends, but I don't want to bother with it until there are at least two of them. The way TFDS represents dataset items is much more flexible than the way Datumaro does it; an item can have features with arbitrary names and types. To map those features onto Datumaro's fixed annotation types, this code uses adapters (which are basically just sequences of predefined callbacks that each convert a certain feature or a metadata element into the Datumaro representation).
I guess that's fair, but where? (Interestingly, TFDS itself doesn't seem to document it at all.) |
I'll vote for just a comment in the online command docs. |
Okay, I added one. |
Use the format of the original dataset by default. Rename `_TfdsAdapter.metadata_transformers` to `category_transformers` to avoid confusion with `metadata`.
@@ -263,3 +265,37 @@ def compare_dirs(test, expected: str, actual: str): | |||
def run_datum(test, *args, expected_code=0): | |||
from datumaro.cli.__main__ import main | |||
test.assertEqual(expected_code, main(args), str(args)) | |||
|
|||
@contextlib.contextmanager | |||
def mock_tfds_data(example=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, we need to place such functionality (mocks, fixtures, pytest-dependents etc.) in the tests/
directory. run_datum
is also a good candidate for moving. This is not a blocker, but we should do it at some point.
Summary
This adds a new
download
command, which downloads public datasets. In terms of syntax & semantics, it's comparable to theconvert
command, except that instead of a source directory and source format, it accepts a "dataset ID".Currently, the downloading is done through an external library, TensorFlow Datasets. However, I expect that this might not always be the case: we might implement native downloading later, or add other download backends. Therefore, the dataset ID must begin with a namespace (currently, the only such namespace is
tfds:
) that signifies the download backend.We could probably make use of the Datumaro plugin manager functionality to implement download backends, but I don't want to bother with it until there are at least two of them.
The way TFDS represents dataset items is much more flexible than the way Datumaro does it; an item can have features with arbitrary names and types. To map those features onto Datumaro's fixed annotation types, this code uses adapters (which are basically just sequences of predefined callbacks that each convert a certain feature or a metadata element into the Datumaro representation).
How to test
Try the
download
command. :-)Checklist
develop
branchLicense
Feel free to contact the maintainers if that's a concern.