Ambiguity is ubiquitous in human communication. Previous approaches in Human-Robot Interaction (HRI) have often relied on predefined interaction templates, leading to reduced performance in realistic and open-ended scenarios. To address these issues, we present a large-scale dataset, InViG, for interactive visual grounding under language ambiguity. Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues, encompassing millions of object instances and corresponding question-answer pairs. Leveraging the InViG dataset, we conduct extensive studies and propose a set of baseline solutions for end-to-end interactive visual disambiguation and grounding, achieving a 45.6\% success rate during validation. To the best of our knowledge, the InViG dataset is the first large-scale dataset for resolving open-ended interactive visual grounding, presenting a practical yet highly challenging benchmark for ambiguity-aware HRI.
We first sample and filter 21K images from OpemImages Dataset, based on which we recruit annotators to label each image with one or more targets and human-to-human dialogues. With 21K labeled data, we further develop an annotation system to automatically generate HRI data. We further generate 500K goal-oriented disambiguation dialogues in extremely low costs. Therefore, in total, our InViG dataset contains more than 520K dialogues for interactive visual grounding. We demonstrate the comparison between InViG Dataset and previous works in Table I. In summary, InViG dataset is proposed to solve the problem of object-oriented open-ended interactive ambiguity in HRI, which widely appears in daily communications between humans. Therefore, differentiated from all previous works, InViG dataset contains extensive interactive disambi guation data to facilitate the development of HRI systems.
Some samples from InViG 21K and InViG 500K:
We have demonstrated some demos of open-ended interactive demos here generated by our baseline solutions. The initial input of our models only includes an image and an initial referential expression. All other dialogue contents and bounding boxes are generated automatically by our models of the Guessor, Questioner, and Oracle. Bounding box candidates of objects are generated using Detic detector.
@inproceedings{zhangxu2023invig
author = {Zhang, Hanbo and Xu, Jie and Mo, Yuchen and Kong, Tao},
title = {InViG: Open-Ended Interactive Visual Grounding in Human-Robot Interaction with 500K Dialogues},
year = {2023}
}