Where Are You? Dataset

The Where Are You? (WAY) dataset consists of 6,134 human embodied localization dialogs across 87 unique indoor environments. The dataset is constructed on the Matterport3D dataset enviroments using the Matterport3D Simulator and was collected using crowd-sourcing on Amazon Mechanical Turk.


v1.0 contains the processed data that is necessary for the LED task. Additional data is needed to work on the tasks of Embodied Visual Dialog (modeling the Observer) and Cooperative Localization (modeling both agents). Please contact: meerahahn@gatech.edu to get the data and starter code for these tasks.


  • train: 4,050 episodes, 58 scenes
  • valSeen: 305 episodes, 58 scenes
  • valUnseen: 579 episodes, 11 scenes
  • test: 1,200 episodes, 18 scenes
word_embeddings.zip (13 MB)
  • Download and place into 'data/language/'
  • glove_weights_matrix.npy is extracted from a 300d GloVe file
  • w2v_weights_matrix.npy is extracted from a 300d Word2Vec file

floorplans.zip (103 MB)
  • Download and place into 'data/floorplans/'
  • Contains top down views of each floor of the house as well as files which associate the pixels on top down maps with Matterport3D panoramic nodes
  • allScans_Node2pix.json is a dictionary of each scan and its panorama ids. Each panorama id is associated with a list where the first index is the pixel coordinates and the third index is the floor of the house the pano is on.

connectivity.zip (2 MB)
  • Download and place into 'data/connectivity/'
  • Contains the connectivity of the Matterport3D panoramic nodes

way_splits.zip (2 MB)
  • Contains the annotations for the train, val and test splits. In the test split, for each episode we do not provide the finalLocation, navPath or detailedNavPath. Test evaluations should be done on the evaluation and leaderboard server.
  • episodeId and socketId is unique to each annotation.
  • dialogArray is an array of each message in chronological order alternating between the Locator and the Observer, starting with the Locator.
  • navPath is an array of viewpoint ids in chronological order that the Observer visits during the episode.
  • detailedNavPath contains the paths taken by the Observer between each round in the dialog. Each array in the list represents a turn of the Observer. Each navigation move is represented by an tuple of [viewpoint, pixel location, floor].

Format of {split}_data.json

      "episodeId": "3041", 
      "scanName": "5q7pvUzZiYa", 
      "dialogArray": [
                      "what do you see?", 
                      "look for a white couch next to a rug with square patterns that are blue black and white.", 
                      "yup. where are you standing in that room?", 
      "finalLocation": {
        "viewPoint": "32073c62923f40c590dcac826c72e2a7", 
        "floor": 0, 
        "pixel_coord": [388, 353], 
        "mesh_coord": [1.08773, 0.892166]
      "navPath": ["efc16a390eb54273be07a53c9ac005b3", "8c29de2e66404a1faf0d953ae8bb67cf", ...], 
      "detailedNavPath": [
        ["efc16a390eb54273be07a53c9ac005b3", ..., "32073c62923f40c590dcac826c72e2a7"], 
        ["32073c62923f40c590dcac826c72e2a7", ..., "32073c62923f40c590dcac826c72e2a7"]
      "socketId": "xVc8PpN1yMkdtRafAATMxYipgp5ZDsWP8McmAATL"

Trained Models to download for LED Task

lingunet-skip.pt (65.7 MB)
  • Download and place into 'data/models/lingunet-skip.pt'
  • Contains the trained lingunet-skip model described in the paper for the LED task.

crossmodal_simple.pt (72.2 MB)
  • Download and place into 'data/models/crossmodal_simple.pt'
  • Contains the trained simple crossmodal model to predict the best viewpoint on the navigation graph.

crossmodal_att.pt (67.4 MB)
  • Download and place into 'data/models/crossmodal_att.pt'
  • Contains the trained crossmodal model with attention to predict the best viewpoint on the navigation graph.