Inplace save & update dataset #102

zhiltsov-max · 2021-02-08T18:36:42Z

Summary

Dataset operations are finally made lazy
Transforms can be performed lazily for a Dataset
Dataset implements caching for input source. Multiple sources are immediately merged.
The order of elements in a Dataset is maintained, but is not guaranteed to be the same after saving and loading
Added partial saving interface for datasets (for in-place dataset updates in the same format)
Implemented partial saving for Datumaro format
Extended Dataset interface with cache control, changed data info, source path and format info
Dataset.get() returns None instead of raising an exception when the item doesn't exist
Supported in operator for Dataset
Added get operation for Extractor
Added type annotations for Dataset class
Extended API model with new interfaces
Converter interface is extended by optional operation to support partial data update (patch()). The default implementation uses the regular full-dataset saving.
Added specific error types to be used instead of generic Exception
Dataset can track updates and generate patches. Transform is considered updating the whole dataset
Dataset.get_subset provides modifiable slices

TBD:

update docs
update CVAT
implement partial save in formats

How to test

Checklist

I submit my changes into the develop branch
I have added description of my changes into CHANGELOG
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues)

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below)

# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

nmanovic · 2021-02-11T11:15:05Z

@zhiltsov-max , do we have any difficulties to solve that: "The order of elements in a Dataset is maintained, but is not guaranteed to be the same after saving and loading"?

I'm not sure that it is critical, but I prefer deterministic behavious if it is easy to achieve.

zhiltsov-max · 2021-02-11T11:32:12Z

@nmanovic, if a format represents a dataset with several subset files, it is impossible to reproduce initial item ordering.

Example:

Dataset:
item(1, 'train')
item(2, 'val')
item(3, 'train')

.save():

train_list.txt
val_list.txt

.load()

item(1, 'train')
item(3, 'train')
item(2, 'val')

tests/test_dataset.py

nmanovic · 2021-02-11T12:12:08Z

@zhiltsov-max , should we update documentation? Are you planning to add some short tutorials for new use cases?

zhiltsov-max · 2021-02-11T12:39:41Z

@nmanovic, I'd prefer to update documentation after new API for operations are introduced, otherwise the changes are hard to perceive. Small catchy examples were added earlier, they still work - but now they also have good performance because of added transparent caching. Thorough documentation will be added with r0.2 (VCS) / r0.3 (stable API) and stable API introduction.

Maxim Zhiltsov added 4 commits February 6, 2021 23:37

t

eaa9b5a

implement caching dataset

980871b

polish the code

a18950b

update tests and fix implementation

300c301

zhiltsov-max changed the title ~~Inplace save & update dataset~~ [WIP] Inplace save & update dataset Feb 8, 2021

Maintain the order of elements in the dataset

e97bf43

zhiltsov-max requested review from yasakova-anastasia and azhavoro February 9, 2021 10:27

Maxim Zhiltsov added 3 commits February 9, 2021 14:13

Add control of lazy and eager operation for datasets

2e1f915

Fix variable access

30cffdb

fix projects

a09ddfd

zhiltsov-max force-pushed the zm/inplace-save branch from 214d859 to a09ddfd Compare February 9, 2021 14:36

Maxim Zhiltsov added 7 commits February 9, 2021 22:18

fixes and implement patch in datumaro format

7e867b8

tests and fixes

d7ea968

add transform chaining test

f0d2a6f

Implement subset file removal in Datumaro format

3f8b549

add error types

bc8484f

fix datatset, move exactmerge

3a810ca

add and update tests

2fbbea5

zhiltsov-max changed the title ~~[WIP] Inplace save & update dataset~~ Inplace save & update dataset Feb 10, 2021

Maxim Zhiltsov added 4 commits February 10, 2021 18:44

update changelog

2f1a62b

update test

abd7376

linter fixes

f6a512f

Merge branch 'develop' into zm/inplace-save

4172883

nmanovic reviewed Feb 11, 2021

View reviewed changes

tests/test_dataset.py Show resolved Hide resolved

zhiltsov-max mentioned this pull request Feb 11, 2021

Update CVAT formats to use changes in Datumaro API cvat-ai/cvat#2794

Merged

8 tasks

zhiltsov-max mentioned this pull request Feb 12, 2021

Inplace save in formats - COCO, CVAT, Datumaro, TF Detection API #107

Merged

7 tasks

Maxim Zhiltsov added 5 commits February 13, 2021 00:16

make linter happier

b721ea2

update examples in docs

a1db49f

update docs

41a60e3

Merge branch 'develop' into zm/inplace-save

4532884

fix icdar

a071a3f

zhiltsov-max merged commit f2a4fdc into develop Feb 13, 2021

zhiltsov-max deleted the zm/inplace-save branch February 16, 2021 10:55

yasakova-anastasia mentioned this pull request Feb 26, 2021

Support for Market-1501 dataset format cvat-ai/cvat#2869

Merged

8 tasks

zhiltsov-max mentioned this pull request May 31, 2021

Dataset.filter doesn't count removed items #257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inplace save & update dataset #102

Inplace save & update dataset #102

zhiltsov-max commented Feb 8, 2021 •

edited

Loading

nmanovic commented Feb 11, 2021

zhiltsov-max commented Feb 11, 2021

nmanovic commented Feb 11, 2021

zhiltsov-max commented Feb 11, 2021

Inplace save & update dataset #102

Inplace save & update dataset #102

Conversation

zhiltsov-max commented Feb 8, 2021 • edited Loading

Summary

How to test

Checklist

License

nmanovic commented Feb 11, 2021

zhiltsov-max commented Feb 11, 2021

nmanovic commented Feb 11, 2021

zhiltsov-max commented Feb 11, 2021

zhiltsov-max commented Feb 8, 2021 •

edited

Loading