Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate goals for objectnav #309

Merged
merged 7 commits into from
Feb 23, 2020
Merged

De-duplicate goals for objectnav #309

merged 7 commits into from
Feb 23, 2020

Conversation

erikwijmans
Copy link
Contributor

Motivation and Context

The object nav datasets are huge, both on disk and in memory. This is largely due to the goals being duplicated across different episodes to the same object category. This PR resolves this by de-duplicating the goals.

Script to dedup current episodes: https://gist.github.com/erikwijmans/49925738798c5ced3d50e12e3d13b491

How Has This Been Tested

It loads the episodes!

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Feb 20, 2020
@codecov
Copy link

codecov bot commented Feb 20, 2020

Codecov Report

Merging #309 into master will decrease coverage by 0.25%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #309      +/-   ##
==========================================
- Coverage   77.34%   77.09%   -0.26%     
==========================================
  Files         108      108              
  Lines        7090     7124      +34     
==========================================
+ Hits         5484     5492       +8     
- Misses       1606     1632      +26     

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f4766f...fc1dc1a. Read the comment docs.

Copy link
Contributor

@mathfac mathfac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we can make this changes work on previous structure of Object Nav. It will check goals_by_category, but won't fail if it's not optimized.

habitat/datasets/object_nav/object_nav_dataset.py Outdated Show resolved Hide resolved
habitat/datasets/object_nav/object_nav_dataset.py Outdated Show resolved Hide resolved
else:
raise RuntimeError("Episodes have no goals")

goals_by_category = deserialized["goals_by_category"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Suggested change
goals_by_category = deserialized["goals_by_category"]
self.goals_by_category = deserialized["goals_by_category"]

@@ -30,7 +30,23 @@ class ObjectNavDatasetV1(PointNavDatasetV1):
content_scenes_path: str = "{data_path}/content/{scene}.json.gz"

def to_json(self) -> str:
self.goals_per_category = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will have this code in from_json, then self.goals_per_category will be save here automatically.

if goals_id not in self.goals_per_category:
self.goals_per_category[goals_id] = ep.goals

self.episodes[i].goals = goals_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key can be recreated from episode data:

Suggested change
self.episodes[i].goals = goals_id
self.episodes[i].goals = []

Copy link
Contributor

@mathfac mathfac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is close to merging. Does test that reads and writes and then reads the dataset works fine with this code?

goals_by_category: Dict[str, List[ObjectGoal]]

@staticmethod
def dedup_goals(dset: Dict[str, Any]) -> Dict[str, Any]:
Copy link
Contributor

@mathfac mathfac Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're doing change in place on dset. Maybe, cleaner will be use this function as:
deserialized["goals_by_category"] = self.dedup_goals(deserialized) and move episode changing part to closer to episode.goals = self.goals_by_category[episode.goals_key].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in-place is done on purpose. It makes a considerable difference on the time it takes to construct the list of episodes if you do this change in-place first.


:param goals_key: Key to retrieve goals with
"""
goals_key: str = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this key only once. I would use function/labda instead and avoid creating it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need it all the time. We need to save out the goals_key to disk. The de-duplicated dataset should be saved to disk for the best performance benefits (for reference, running de-dup takes 30 minutes, so it really should be done just once).

@@ -63,8 +114,18 @@ def from_json(
self.category_to_scene_annotation_category_id.keys()
), "category_to_task and category_to_mp3d must have the same keys"

for episode in deserialized["episodes"]:
episode = NavigationEpisode(**episode)
if len(deserialized["episodes"]) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to drop stop it here? I can imagine that some data that could left after deleting episodes won't be read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there aren't any episodes, there aren't any goals, so stopping here makes sense.

@mathfac
Copy link
Contributor

mathfac commented Feb 21, 2020

Looks good to go. One last piece worth to check how it will work with CONTENT_SCENES.

@erikwijmans erikwijmans merged commit f8af6d1 into master Feb 23, 2020
dhruvbatra pushed a commit that referenced this pull request May 10, 2020
* added MeshTransformNode to store mesh hierarchies and relative transforms for loaded assets in MeshMetaData

* refactor addComponent to utilize stored transforms instead of re-importing from file
@Skylion007 Skylion007 deleted the dedup-object-nav branch October 27, 2020 16:54
dannymcy pushed a commit to dannymcy/habitat-lab that referenced this pull request Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants