De-duplicate goals for objectnav #309

erikwijmans · 2020-02-20T23:34:29Z

Motivation and Context

The object nav datasets are huge, both on disk and in memory. This is largely due to the goals being duplicated across different episodes to the same object category. This PR resolves this by de-duplicating the goals.

Script to dedup current episodes: https://gist.github.com/erikwijmans/49925738798c5ced3d50e12e3d13b491

How Has This Been Tested

It loads the episodes!

Types of changes

New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

codecov · 2020-02-20T23:49:30Z

Codecov Report

Merging #309 into master will decrease coverage by 0.25%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #309      +/-   ##
==========================================
- Coverage   77.34%   77.09%   -0.26%     
==========================================
  Files         108      108              
  Lines        7090     7124      +34     
==========================================
+ Hits         5484     5492       +8     
- Misses       1606     1632      +26

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f4766f...fc1dc1a. Read the comment docs.

mathfac

Maybe, we can make this changes work on previous structure of Object Nav. It will check goals_by_category, but won't fail if it's not optimized.

habitat/datasets/object_nav/object_nav_dataset.py

mathfac · 2020-02-21T02:14:38Z

habitat/datasets/object_nav/object_nav_dataset.py

+            else:
+                raise RuntimeError("Episodes have no goals")
+
+        goals_by_category = deserialized["goals_by_category"]


Maybe:

Suggested change

goals_by_category = deserialized["goals_by_category"]

self.goals_by_category = deserialized["goals_by_category"]

mathfac · 2020-02-21T02:15:16Z

habitat/datasets/object_nav/object_nav_dataset.py

@@ -30,7 +30,23 @@ class ObjectNavDatasetV1(PointNavDatasetV1):
    content_scenes_path: str = "{data_path}/content/{scene}.json.gz"

    def to_json(self) -> str:
+        self.goals_per_category = {}


If we will have this code in from_json, then self.goals_per_category will be save here automatically.

mathfac · 2020-02-21T02:16:12Z

habitat/datasets/object_nav/object_nav_dataset.py

+            if goals_id not in self.goals_per_category:
+                self.goals_per_category[goals_id] = ep.goals
+
+            self.episodes[i].goals = goals_id


Key can be recreated from episode data:

Suggested change

self.episodes[i].goals = goals_id

self.episodes[i].goals = []

mathfac

This code is close to merging. Does test that reads and writes and then reads the dataset works fine with this code?

habitat/tasks/nav/object_nav_task.py

mathfac · 2020-02-21T18:06:18Z

habitat/datasets/object_nav/object_nav_dataset.py

+    goals_by_category: Dict[str, List[ObjectGoal]]
+
+    @staticmethod
+    def dedup_goals(dset: Dict[str, Any]) -> Dict[str, Any]:


you're doing change in place on dset. Maybe, cleaner will be use this function as:
deserialized["goals_by_category"] = self.dedup_goals(deserialized) and move episode changing part to closer to episode.goals = self.goals_by_category[episode.goals_key].

The change in-place is done on purpose. It makes a considerable difference on the time it takes to construct the list of episodes if you do this change in-place first.

habitat/datasets/object_nav/object_nav_dataset.py

mathfac · 2020-02-21T18:16:17Z

habitat/tasks/nav/object_nav_task.py

+
+    :param goals_key: Key to retrieve goals with
+    """
+    goals_key: str = None


We need this key only once. I would use function/labda instead and avoid creating it.

You need it all the time. We need to save out the goals_key to disk. The de-duplicated dataset should be saved to disk for the best performance benefits (for reference, running de-dup takes 30 minutes, so it really should be done just once).

mathfac · 2020-02-21T22:29:14Z

habitat/datasets/object_nav/object_nav_dataset.py

@@ -63,8 +114,18 @@ def from_json(
            self.category_to_scene_annotation_category_id.keys()
        ), "category_to_task and category_to_mp3d must have the same keys"

-        for episode in deserialized["episodes"]:
-            episode = NavigationEpisode(**episode)
+        if len(deserialized["episodes"]) == 0:


Do we really need to drop stop it here? I can imagine that some data that could left after deleting episodes won't be read.

If there aren't any episodes, there aren't any goals, so stopping here makes sense.

mathfac · 2020-02-21T22:30:54Z

Looks good to go. One last piece worth to check how it will work with CONTENT_SCENES.

* added MeshTransformNode to store mesh hierarchies and relative transforms for loaded assets in MeshMetaData * refactor addComponent to utilize stored transforms instead of re-importing from file

* De-duplicate goals for objectnav

De-duplicate goals for objectnav

06ca123

erikwijmans requested a review from mathfac February 20, 2020 23:34

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Feb 20, 2020

mathfac reviewed Feb 20, 2020

View reviewed changes

habitat/datasets/object_nav/object_nav_dataset.py Outdated Show resolved Hide resolved

habitat/datasets/object_nav/object_nav_dataset.py Outdated Show resolved Hide resolved

erikwijmans added 3 commits February 20, 2020 19:07

Comments

ea1d6f5

Handle train.json.gz file correctly

50fe0c6

Raise instead of assert

1446f2a