Abstract: Extending large image-text pre-trained models (e.g., CLIP) for video understanding has made significant advancements. To enable the capability of CLIP to perceive dynamic information in ...
Abstract: Text-to-video retrieval is an essential task in multimedia information retrieval, enabling users to search and retrieve videos based on natural language descriptions. In this paper, we ...