Abstract: Visual grounding focuses on localizing objects referred to by natural language queries. Existing fully and weakly supervised methods rely on a mass of language queries for training. However, ...