Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dm.Dataset index access #1247

Merged
merged 4 commits into from
Jan 29, 2024

Conversation

itrushkin
Copy link
Contributor

@itrushkin itrushkin commented Jan 22, 2024

Issue No: 126681

Summary

This PR introduces dataset random index access for the easy conversion from dm.Dataset to torch.Dataset. The implementation requires an additional list property for DatasetItemStorage class, which would cost additional O(n) memory. It doesn't store items but references to them (subset and ID).

Concern:
I wonder whether it is possible to store and manage this list for indexing, removal, addition, and iteration to reduce memory consumption. I tried to combine the following changes:

  1. The _traversal_order dictionary seems unnecessary as all data is already stored and managed in the data property via .put() and .remove() methods. __len__, __iter__, and other methods can rely solely on data. Optionally, an _order property could enable O(1) __len__ implementation.
  2. Instead of overwriting data values with None to mark removal, deleting the entire dictionary entry accurately reflects the dataset state and eliminates ambiguity.

As a result, tests set test_can_create_patch* started to fail. It appears that items were not actually removed from the dataset, but merely marked as such. This suggests either the cache initialization or element remove operation needs adjusting for this scenario. Resolving this with assistance could significantly improve dm.Dataset interaction.

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have added the description of my changes into CHANGELOG.​
  • I have updated the documentation accordingly

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT

Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>
@itrushkin itrushkin requested review from a team as code owners January 22, 2024 14:50
@itrushkin itrushkin requested review from sooahleex and removed request for a team January 22, 2024 14:50
Copy link

codecov bot commented Jan 22, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (eb44355) 80.53% compared to head (dba46e6) 80.54%.
Report is 4 commits behind head on develop.

❗ Current head dba46e6 differs from pull request most recent head bdbe2c2. Consider uploading reports for the commit bdbe2c2 to get more accurate results

Files Patch % Lines
src/datumaro/components/dataset_item_storage.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1247      +/-   ##
===========================================
+ Coverage    80.53%   80.54%   +0.01%     
===========================================
  Files          270      270              
  Lines        30232    30260      +28     
  Branches      5898     5906       +8     
===========================================
+ Hits         24348    24374      +26     
- Misses        4506     4507       +1     
- Partials      1378     1379       +1     
Flag Coverage Δ
ubuntu-20.04_Python-3.8 80.53% <94.11%> (+0.01%) ⬆️
windows-2022_Python-3.8 80.51% <94.11%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>
@vinnamkim
Copy link
Contributor

vinnamkim commented Jan 24, 2024

Hi @itrushkin,

Could we add a test for this feature about the scenarios where the size of dataset increases or decreases? For example,

a) Scenario with dm.Dataset.put() and dm.Dataset.remove().

  1. Construct an arbitrary dm.Dataset
  2. Get an item from the dataset with index
  3. Increase the dataset size with dm.Dataset.put or decrease it with dm.Dataset.put couple of times.
  4. Get an item from the dataset with index and check whether it behaves correctly

b) Scenario with transforms

  1. Construct an arbitrary dm.Dataset
  2. Get an item from the dataset with index
  3. Increase or decrease the dataset size with Datumaro transforms (I guess you might use tile/untile transforms for both https://github.com/openvinotoolkit/datumaro/tree/develop/src/datumaro/plugins/tiling).
  4. Get an item from the dataset with index and check whether it behaves correctly

Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>
Signed-off-by: Ilya Trushkin <ilya.trushkin@intel.com>
Copy link
Contributor

@vinnamkim vinnamkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@vinnamkim vinnamkim merged commit 0d8c301 into openvinotoolkit:develop Jan 29, 2024
1 of 3 checks passed
@yunchu yunchu added this to the 2.0.0 milestone Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants