continuous 3D environments, which requires an autonomous agent to follow natural
language instructions in unseen environments. Existing end-to-end learning-based VLN
methods struggle at this task as they focus mostly on utilizing raw visual observations and
lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to
new environments. In this regard, we present a hybrid transformer-recurrence model which …