Skip to content

Room-across-Room (RxR) is a large-scale, multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments. It contains 126k navigation instructions in English, Hindi and Telugu, and 126k navigation following demonstrations. Both annotation types include dense spatiotemporal alignments between the text and the visual per…

License

Notifications You must be signed in to change notification settings

google-research-datasets/RxR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Room-Across-Room (RxR) Dataset

Room-Across-Room (RxR) is a multilingual dataset for Vision-and-Language Navigation (VLN) for Matterport3D environments. In contrast to related datasets such as Room-to-Room (R2R), RxR is 10x larger, multilingual (English, Hindi and Telugu), with longer and more variable paths, and it includes and fine-grained visual groundings that relate each word to pixels/surfaces in the environment.

pose_trace

RxR is released as gzipped JSON Lines and numpy archives, and has four components: guide annotations, follower annotations, pose traces, and text features. The guide annotations alone are akin to R2R and sufficient to run the standard VLN setup.

Reference

The RxR dataset is described in Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding.

Bibtex:

@inproceedings{rxr,
  title={{Room-Across-Room}: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding},
  author={Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge},
  booktitle={Conference on Empirical Methods for Natural Language Processing (EMNLP)},
  year={2020}
}

Dataset Download

To download the full 161GB dataset (guide annotations, follower annotations, pose traces, text features, grounded landmarks, and model-generated instructions), install the gsutil tool and run:

gsutil -m cp -R gs://rxr-data .

Using -m gsutil downloads using multi-threading and multi-processing, using a number of threads and processors determined by the parallel_thread_count and parallel_process_count values set in the boto configuration file. You might want to experiment with these values for the fastest download.

Alternatively, see the instructions below for downloading separate components of the dataset.

Downloading Guide Annotations

Each JSON Lines entry contains a guide annotation for a path in the environment.

Data schema:

{'split': str,
 'instruction_id': int,
 'annotator_id': int,
 'language': str,
 'path_id': int,
 'scan': str,
 'path': Sequence[str],
 'heading': float,
 'instruction': str,
 'timed_instruction': Sequence[Mapping[str, Union[str, float]]],
 'edit_distance': float}

Field descriptions:

  • split: The annotation split: train, val_seen, val_unseen, test_standard.
  • instruction_id: Uniquely identifies the guide annotation.
  • annotator_id: Uniquely identifies the guide annotator.
  • language: The IETF BCP 47 language tag: en-IN, en-US, hi-IN, te-IN.
  • path_id: Uniquely identifies a path sampled from the Matterport3D environment.
  • scan: Uniquely identifies a scan in the Matterport3D environment.
  • path: A sequence of panoramic viewpoints along the path.
  • heading: The initial heading in radians. Following R2R, the heading angle is zero facing the y-axis with z-up, and increases by turning right.
  • instruction: The navigation instruction.
  • timed_instruction: A sequence of time-aligned words in the instruction. Note that a small number of words are missing the start_time and end_time fields.
    • word: The aligned utterance.
    • start_time: The start of the time span, w.r.t. the recording.
    • end_time: The end of the time span, w.r.t. the recording.
  • edit_distance Edit distance between the manually transcribed instructions and the automatic transcript generated by Google Cloud Text-to-Speech API.

Sample entry:

{'path_id': 11,
 'split': 'val_seen',
 'scan': '2n8kARJN3HM',
 'heading': 3.105381634905035,
 'path': ['d38a4c31821c48ac9082d896e628c128',
  '1d6a100cf3d34326936ef7d0a50840d9',
  '87998608d4844fcfaca266bd5aba6516',
  '5248918af65645a28a65f59d3424598a',
  'e0b2917ecb6d4e31846957451348f80a'],
 'instruction_id': 26,
 'annotator_id': 19,
 'language': 'en-IN',
 'instruction': 'Okay, now you are in a room facing towards two bathtubs, one on the right side ...',
 'timed_instruction': [{'start_time': 0.4, 'word': 'Okay,', 'end_time': 1.0}, ...],
 'edit_distance': 0.11137440758293839}

Guide annotations can be downloaded from these links:

Downloading Follower Annotations

Each JSON Lines entry contains a follower annotation for an instruction in the guide annotations.

Data schema:

{'demonstration_id': int,
 'instruction_id': int,
 'annotator_id': int,
 'path': Sequence[str],
 'metrics': {'ne': float,
  'sr': float,
  'spl': float,
  'dtw': float,
  'ndtw': float,
  'sdtw': float}}

Field descriptions:

  • demonstration_id: Uniquely identifies the follower annotation.
  • instruction_id: Uniquely identifies the guide annotation being followed.
  • annotator_id: Uniquely identifies the follower annotator.
  • path: A sequence of panoramic viewpoint along the path traversed by the follower (which may differ to the guide path).
  • metrics: Evaluation metrics for the follower path measured against the guide path:
    • ne: Navigation Error, the shortest-path distance between the final viewpoint in each path.
    • sr: Success Rate, whether ne is less than 3 meters.
    • spl Success weighted by Path Length, success weighted by traversal efficiency.
    • dtw Dynamic Time Warping (DTW) cost, the divergence between guide and follower paths.
    • ndtw Normalized DTW cost.
    • sdtw Success-weighted normalized DTW cost.

Sample entry:

{'demonstration_id': 26,
 'instruction_id': 26,
 'annotator_id': 43,
 'path': ['d38a4c31821c48ac9082d896e628c128',
  '1d6a100cf3d34326936ef7d0a50840d9',
  'd8eb4eab2d3442e1a3a7a74fc810be22',
  '87998608d4844fcfaca266bd5aba6516',
  '5248918af65645a28a65f59d3424598a',
  'e0b2917ecb6d4e31846957451348f80a'],
 'metrics': {'ne': 0,
  'sr': 1.0,
  'spl': 0.9479355350139731,
  'dtw': 1.5461700564297585,
  'ndtw': 0.9020566069282614,
  'sdtw': 0.9020566069282614}}

Follower annotations can be downloaded from these links:

Extended Dataset

The extended RxR dataset is substantially larger and comprised of many small files. Therefore, we recommend installing the gsutil tool to copy the dataset in parallel. Individual download via URL is also an option.

Downloading Pose Traces

Guide and follower annotations are paired with their corresponding pose traces: a sequence of snapshots that capture the annotator's virtual camera pose and field-of-view. The naming convention for these files is {instruction_id:06}_guide_pose_trace.npz for guide annotations and {demonstration_id:06}_follower_pose_trace.npz for follower annotations.

Data schema:

{'pano': (np.str, [n, 1]),
 'time': (np.float32, [n, 1]),
 'audio_time': (np.float32, [n, 1]),
 'extrinsic_matrix': (np.float32, [n, 16]),
 'intrinsic_matrix': (np.float32, [n, 16]),
 'image_mask': (np.bool, [k, 128, 256]),
 'text_masks': (np.bool, [k, m]),
 'feature_weights': (np.float32, [k, 36])}

Where n is the number of snapshots, k is the number of panoramic viewpoints in the associated path, and m is the number of BERT SubWord in the tokenized instructions.

Field descriptions:

  • pano: Panoramic viewpoint of the snapshot.
  • time: Timestamp of the snapshot in seconds.
  • audio_time: Timestamp corresponding to the follower's progress listening to the guide's instruction recording. This is only included in follower pose traces.
  • extrinsic_matrix: The extrinsic parameters, or pose, of the annotator's camera.
  • intrinsic_matrix: The intrinsic parameters, or projection matrix, of the annotator's camera.
  • image_mask: Mask indicating the pixels observed in the panorama. This mask is in equirectangular format, with heading angle 0 being the center of the image.
  • text_masks: Mask indicating the utterances that have been spoken or heard by the guide or follower, respectively, at this panoramic viewpoint.
  • feature_weights: An image_mask in feature space, corresponding to the typical setting in which 36 image features are generated at 12 heading and 3 elevation increments. Each value is a mean-pooled perspective projection of the image_mask for a particular heading and elevation.

Note that image_mask, text_mask, and feature_weights are provided solely for convenience, as they can be generated from the other pose trace fields and the timed_instruction.

Download command (downloads 18.6GB):

gsutil -m cp -R gs://rxr-data/pose_traces .

Individual pose traces can be downloaded via URL: * https://storage.cloud.google.com/rxr-data/pose_traces/{split}/{instruction_id:06}_guide_pose_trace.npz * https://storage.cloud.google.com/rxr-data/pose_traces/{split}/{demonstration_id:06}_follower_pose_trace.npz

Where split is one of rxr_train, rxr_val_seen, rxr_val_unseen.

Downloading BERT Text Features

Multilingual Cased BERT features for the instructions, and cross-translations thereof, are provided solely for convenience. Cross-translations (e.g., en -> hi, te) are generated via the Google Cloud Translation API and are included as well.

Data schema:

{'tokens': (np.str, [m, 1]), 'features': (np.float32, [m, 768])}

Where m is the number of BERT SubWord in the tokenized instructions.

Field descriptions:

  • tokens: Sequence of BERT SubWord tokens.
  • features: Features extracted from the last layer of the BERT encoder.

Download command (downloads 142.5GB):

gsutil -m cp -R gs://rxr-data/text_features .

Individual text features can be downloaded via URL: * https://storage.cloud.google.com/rxr-data/pose_traces/{split}/{instruction_id:06}_{language}_text_features.npz

Where split is one of: rxr_train, rxr_val_seen, rxr_val_unseen; and language is one of: en, hi, te.

Downloading Grounded Landmark Data

We include a dataset of 1.1m grounded landmarks annotated on top of RxR instructions using weak supervision from text parsers, RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images.

For details check the marky-mT5 subdirectory.

Downloading Model-Generated Instructions for Data Augmentation

We include an additional dataset of 1m model-generated navigation instructions describing additional paths sampled from training environments.

For details check the marky-mT5 subdirectory.

Visualizations

To visualize examples of RxR instructions and pose traces, code is provided in the visualizations subdirectory.

Test Server and Leaderboard

To benchmark progress on the dataset, we have launched two competitions with public leaderboards. See the links for details:

Changelog

  • 2022-04-04: Updated grounded landmark data in the marky-mT5 subdirectory to include 58 additional grounded instructions.
  • 2022-03-28: Added grounded landmark data and model-generated instructions in the marky-mT5 subdirectory.
  • 2021-05-24: Updated 7 corrupted guide_pose_trace.npz files in rxr_train (with instruction ids 105606, 31170, 39104, 63599, 102712, 83073, 110502).

Contact us

If you have a technical question regarding the dataset or publication, please create an issue in this repository. This is the fastest way to reach us.

If you would like to share feedback or report concerns, please email us at rxrvln@google.com.

License

The Matterport3D dataset is governed by the Matterport3D Terms of Use. RxR annotations are released under the CC-BY license.

About

Room-across-Room (RxR) is a large-scale, multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments. It contains 126k navigation instructions in English, Hindi and Telugu, and 126k navigation following demonstrations. Both annotation types include dense spatiotemporal alignments between the text and the visual per…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages