We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Our smallest variant outperforms ...
Between 1-8 GPUs with 27-80 GB, depending on the desired training setup (with default bfloat16 data type). See this FAQ on our project website for details. First, set up a conda environment (see ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results