Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inplace save & update dataset #102

Merged
merged 24 commits into from
Feb 13, 2021
Merged

Inplace save & update dataset #102

merged 24 commits into from
Feb 13, 2021

Conversation

zhiltsov-max
Copy link
Contributor

@zhiltsov-max zhiltsov-max commented Feb 8, 2021

Summary

  • Dataset operations are finally made lazy
  • Transforms can be performed lazily for a Dataset
  • Dataset implements caching for input source. Multiple sources are immediately merged.
  • The order of elements in a Dataset is maintained, but is not guaranteed to be the same after saving and loading
  • Added partial saving interface for datasets (for in-place dataset updates in the same format)
  • Implemented partial saving for Datumaro format
  • Extended Dataset interface with cache control, changed data info, source path and format info
  • Dataset.get() returns None instead of raising an exception when the item doesn't exist
  • Supported in operator for Dataset
  • Added get operation for Extractor
  • Added type annotations for Dataset class
  • Extended API model with new interfaces
  • Converter interface is extended by optional operation to support partial data update (patch()). The default implementation uses the regular full-dataset saving.
  • Added specific error types to be used instead of generic Exception
  • Dataset can track updates and generate patches. Transform is considered updating the whole dataset
  • Dataset.get_subset provides modifiable slices

TBD:

  • update docs
  • update CVAT
  • implement partial save in formats

How to test

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@zhiltsov-max zhiltsov-max changed the title Inplace save & update dataset [WIP] Inplace save & update dataset Feb 8, 2021
@zhiltsov-max zhiltsov-max changed the title [WIP] Inplace save & update dataset Inplace save & update dataset Feb 10, 2021
@nmanovic
Copy link

@zhiltsov-max , do we have any difficulties to solve that: "The order of elements in a Dataset is maintained, but is not guaranteed to be the same after saving and loading"?

I'm not sure that it is critical, but I prefer deterministic behavious if it is easy to achieve.

@zhiltsov-max
Copy link
Contributor Author

@nmanovic, if a format represents a dataset with several subset files, it is impossible to reproduce initial item ordering.

Example:

Dataset:
item(1, 'train')
item(2, 'val')
item(3, 'train')

.save():

train_list.txt
val_list.txt

.load()

item(1, 'train')
item(3, 'train')
item(2, 'val')

@nmanovic
Copy link

@zhiltsov-max , should we update documentation? Are you planning to add some short tutorials for new use cases?

@zhiltsov-max
Copy link
Contributor Author

@nmanovic, I'd prefer to update documentation after new API for operations are introduced, otherwise the changes are hard to perceive. Small catchy examples were added earlier, they still work - but now they also have good performance because of added transparent caching. Thorough documentation will be added with r0.2 (VCS) / r0.3 (stable API) and stable API introduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants