Implement dm.Dataset index access #1247

itrushkin · 2024-01-22T14:49:59Z

Issue No: 126681

Summary

This PR introduces dataset random index access for the easy conversion from dm.Dataset to torch.Dataset. The implementation requires an additional list property for DatasetItemStorage class, which would cost additional O(n) memory. It doesn't store items but references to them (subset and ID).

Concern:
I wonder whether it is possible to store and manage this list for indexing, removal, addition, and iteration to reduce memory consumption. I tried to combine the following changes:

The _traversal_order dictionary seems unnecessary as all data is already stored and managed in the data property via .put() and .remove() methods. __len__, __iter__, and other methods can rely solely on data. Optionally, an _order property could enable O(1) __len__ implementation.
Instead of overwriting data values with None to mark removal, deleting the entire dictionary entry accurately reflects the dataset state and eliminates ambiguity.

As a result, tests set test_can_create_patch* started to fail. It appears that items were not actually removed from the dataset, but merely marked as such. This suggests either the cache initialization or element remove operation needs adjusting for this scenario. Resolving this with assistance could significantly improve dm.Dataset interaction.

How to test

Checklist

I have added unit tests to cover my changes.
I have added integration tests to cover my changes.
I have added the description of my changes into CHANGELOG.
I have updated the documentation accordingly

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT

Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>

codecov · 2024-01-22T14:59:50Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (eb44355) 80.53% compared to head (dba46e6) 80.54%.
Report is 4 commits behind head on develop.

❗ Current head dba46e6 differs from pull request most recent head bdbe2c2. Consider uploading reports for the commit bdbe2c2 to get more accurate results

Files	Patch %	Lines
src/datumaro/components/dataset_item_storage.py	88.88%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1247      +/-   ##
===========================================
+ Coverage    80.53%   80.54%   +0.01%     
===========================================
  Files          270      270              
  Lines        30232    30260      +28     
  Branches      5898     5906       +8     
===========================================
+ Hits         24348    24374      +26     
- Misses        4506     4507       +1     
- Partials      1378     1379       +1

Flag	Coverage Δ
ubuntu-20.04_Python-3.8	`80.53% <94.11%> (+0.01%)`	⬆️
windows-2022_Python-3.8	`80.51% <94.11%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>

vinnamkim · 2024-01-24T04:43:28Z

Hi @itrushkin,

Could we add a test for this feature about the scenarios where the size of dataset increases or decreases? For example,

a) Scenario with dm.Dataset.put() and dm.Dataset.remove().

Construct an arbitrary dm.Dataset
Get an item from the dataset with index
Increase the dataset size with dm.Dataset.put or decrease it with dm.Dataset.put couple of times.
Get an item from the dataset with index and check whether it behaves correctly

b) Scenario with transforms

Construct an arbitrary dm.Dataset
Get an item from the dataset with index
Increase or decrease the dataset size with Datumaro transforms (I guess you might use tile/untile transforms for both https://github.com/openvinotoolkit/datumaro/tree/develop/src/datumaro/plugins/tiling).
Get an item from the dataset with index and check whether it behaves correctly

Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>

vinnamkim

LGTM.

Implement dataset index access

37c02d4

Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>

itrushkin requested review from a team as code owners January 22, 2024 14:50

itrushkin requested review from sooahleex and removed request for a team January 22, 2024 14:50

Linter fix

dba46e6

Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>

wonjuleee requested a review from vinnamkim January 23, 2024 01:00

itrushkin added 2 commits January 28, 2024 20:06

Add tiling transform test

3d8ae86

Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>

Compare item ids in tests

bdbe2c2

Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>

vinnamkim approved these changes Jan 29, 2024

View reviewed changes

vinnamkim merged commit 0d8c301 into openvinotoolkit:develop Jan 29, 2024
1 of 3 checks passed

wonjuleee mentioned this pull request Feb 1, 2024

Use datum dataset for PyTorch #1212

Closed

yunchu added this to the 2.0.0 milestone Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement dm.Dataset index access #1247

Implement dm.Dataset index access #1247

itrushkin commented Jan 22, 2024 •

edited

Loading

codecov bot commented Jan 22, 2024 •

edited

Loading

vinnamkim commented Jan 24, 2024 •

edited

Loading

vinnamkim left a comment

Implement dm.Dataset index access #1247

Implement dm.Dataset index access #1247

Conversation

itrushkin commented Jan 22, 2024 • edited Loading

Summary

How to test

Checklist

License

codecov bot commented Jan 22, 2024 • edited Loading

Codecov Report

vinnamkim commented Jan 24, 2024 • edited Loading

vinnamkim left a comment

Choose a reason for hiding this comment

itrushkin commented Jan 22, 2024 •

edited

Loading

codecov bot commented Jan 22, 2024 •

edited

Loading

vinnamkim commented Jan 24, 2024 •

edited

Loading