|
| 1 | +<div align="center"> |
| 2 | + |
| 3 | +# CoVR: Composed Video Retrieval |
| 4 | +## Learning Composed Video Retrieval from Web Video Captions |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | +</div> |
| 9 | + |
| 10 | +<div align="justify"> |
| 11 | + |
| 12 | +> Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers _both_ text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR _triplets_ is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption _pairs_, while also expanding the scope of the task to include composed _video_ retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available. |
| 13 | + |
| 14 | +</div> |
| 15 | + |
| 16 | +## Description |
| 17 | +This repository contains the code for the paper ["CoVR: Learning Composed Video Retrieval from Web Video Captions"](https://arxiv.org/abs/2308.TODO). |
| 18 | + |
| 19 | +Please visit our [webpage](http://imagine.enpc.fr/~ventural/covr) for more details. |
| 20 | + |
| 21 | +This repository contains: |
| 22 | + |
| 23 | +```markdown |
| 24 | +📦 covr |
| 25 | + ┣ 📂 configs # hydra config files |
| 26 | + ┣ 📂 src # Pytorch datamodules |
| 27 | + ┣ 📂 tools # scrips and notebooks |
| 28 | + ┣ 📜 .gitignore |
| 29 | + ┣ 📜 LICENSE |
| 30 | + ┣ 📜 README.md |
| 31 | + ┣ 📜 test.py |
| 32 | + ┗ 📜 train.py |
| 33 | + |
| 34 | + ``` |
| 35 | + |
| 36 | +## Installation :construction_worker: |
| 37 | + |
| 38 | +<details><summary>Create environment</summary> |
| 39 | +  |
| 40 | + |
| 41 | +```bash |
| 42 | +conda create --name covr |
| 43 | +conda activate covr |
| 44 | +``` |
| 45 | + |
| 46 | +Install the following packages inside the conda environment: |
| 47 | + |
| 48 | +```bash |
| 49 | +python -m pip install pytorch_lightning --upgrade |
| 50 | +python -m pip install hydra-core --upgrade |
| 51 | +python -m pip install lightning |
| 52 | +python -m pip install einops |
| 53 | +python -m pip install pandas |
| 54 | +python -m pip install opencv-python |
| 55 | +python -m pip install timm |
| 56 | +python -m pip install fairscale |
| 57 | +python -m pip install tabulate |
| 58 | +python -m pip install transformers |
| 59 | +``` |
| 60 | + |
| 61 | +The code was tested on Python 3.8 and PyTorch 2.0. |
| 62 | + |
| 63 | +</details> |
| 64 | + |
| 65 | +<details><summary>Download the datasets</summary> |
| 66 | + |
| 67 | +### WebVid-CoVR |
| 68 | +To use the WebVid-CoVR dataset, you will have to download the WebVid videos and the WebVid-CoVR annotations. |
| 69 | + |
| 70 | +To download the annotations, run: |
| 71 | +```bash |
| 72 | +bash tools/scripts/download_annotations.sh covr |
| 73 | +``` |
| 74 | + |
| 75 | +To download the videos, install [`mpi4py`](https://mpi4py.readthedocs.io/en/latest/install.html#) and run: |
| 76 | +```bash |
| 77 | +python tools/scripts/download_covr.py <split> |
| 78 | +``` |
| 79 | + |
| 80 | +### CIRR |
| 81 | +To use the CIRR dataset, you will have to download the CIRR images and the CIRR annotations. |
| 82 | + |
| 83 | +To download the annotations, run: |
| 84 | +```bash |
| 85 | +bash tools/scripts/download_annotations.sh cirr |
| 86 | +``` |
| 87 | + |
| 88 | +To download the images, follow the instructions in the [CIRR repository](https://github.com/lil-lab/nlvr/tree/master/nlvr2#direct-image-download). The default folder structure is the following: |
| 89 | + |
| 90 | +```markdown |
| 91 | +📦 covr |
| 92 | + ┣ 📂 datasets |
| 93 | + ┃ ┣ 📂 CIRR |
| 94 | + ┃ ┃ ┣ 📂 images |
| 95 | + ┃ ┃ ┃ ┣ 📂 train |
| 96 | + ┃ ┃ ┃ ┣ 📂 dev |
| 97 | + ┃ ┃ ┃ ┗ 📂 test1 |
| 98 | +``` |
| 99 | + |
| 100 | +### FashionIQ |
| 101 | +To use the FashionIQ dataset, you will have to download the FashionIQ images and the FashionIQ annotations. |
| 102 | + |
| 103 | +To download the annotations, run: |
| 104 | +```bash |
| 105 | +bash tools/scripts/download_annotations.sh fiq |
| 106 | +``` |
| 107 | + |
| 108 | +To download the images, the urls are in the [FashionIQ repository](https://github.com/hongwang600/fashion-iq-metadata/tree/master/image_url). You can use the [this script](https://github.com/yanbeic/VAL/blob/master/download_fashion_iq.py) to download the images. Some missing images can also be found [here](https://github.com/XiaoxiaoGuo/fashion-iq/issues/18). All the images should be placed in the same folder (``datasets/fashion-iq/images``). |
| 109 | + |
| 110 | +</details> |
| 111 | + |
| 112 | + |
| 113 | +<details><summary>(Optional) Download pre-trained models</summary> |
| 114 | + |
| 115 | +`` |
| 116 | + |
| 117 | +To download the checkpoints, run: |
| 118 | +```bash |
| 119 | +bash tools/scripts/download_pretrained_models.sh |
| 120 | +``` |
| 121 | + |
| 122 | +</details> |
| 123 | + |
| 124 | + |
| 125 | +## Usage :computer: |
| 126 | +<details><summary>Computing BLIP embeddings</summary> |
| 127 | +  |
| 128 | + |
| 129 | +Before training, you will need to compute the BLIP embeddings for the videos/images. To do so, run: |
| 130 | +```bash |
| 131 | +python tools/embs/save_blip_embs_vids.py # This will compute the embeddings for the WebVid-CoVR videos. |
| 132 | +python tools/embs/save_blip_embs_imgs.py # This will compute the embeddings for the CIRR or FashionIQ images. |
| 133 | +``` |
| 134 | + |
| 135 | +  |
| 136 | +</details> |
| 137 | + |
| 138 | + |
| 139 | +<details><summary>Training</summary> |
| 140 | +  |
| 141 | + |
| 142 | +The command to launch a training experiment is the folowing: |
| 143 | +```bash |
| 144 | +python train.py [OPTIONS] |
| 145 | +``` |
| 146 | +The parsing is done by using the powerful [Hydra](https://github.com/facebookresearch/hydra) library. You can override anything in the configuration by passing arguments like ``foo=value`` or ``foo.bar=value``. |
| 147 | + |
| 148 | +  |
| 149 | +</details> |
| 150 | + |
| 151 | +<details><summary>Evaluating</summary> |
| 152 | +  |
| 153 | + |
| 154 | +The command to evaluate is the folowing: |
| 155 | +```bash |
| 156 | +python test.py test=<test> [OPTIONS] |
| 157 | +``` |
| 158 | +  |
| 159 | +</details> |
| 160 | + |
| 161 | +<details><summary>Options parameters</summary> |
| 162 | + |
| 163 | +#### Datasets: |
| 164 | +- ``data=webvid-covr``: WebVid-CoVR datasets. |
| 165 | +- ``data=cirr``: CIRR dataset. |
| 166 | +- ``data=fashioniq-split``: FashionIQ dataset, change ``split`` to ``dress``, ``shirt`` or ``toptee``. |
| 167 | + |
| 168 | +#### Tests: |
| 169 | +- ``test=all``: Test on WebVid-CoVR, CIRR and all three Fashion-IQ test sets. |
| 170 | +- ``test=webvid-covr``: Test on WebVid-CoVR. |
| 171 | +- ``test=cirr``: Test on CIRR. |
| 172 | +- ``test=fashioniq``: Test on all three Fashion-IQ test sets (``dress``, ``shirt`` and ``toptee``). |
| 173 | + |
| 174 | +#### Checkpoints: |
| 175 | +- ``model/ckpt=blip-l-coco``: Default checkpoint for BLIP-L finetuned on COCO. |
| 176 | +- ``model/ckpt=webvid-covr``: Default checkpoint for CoVR finetuned on WebVid-CoVR. |
| 177 | + |
| 178 | +#### Training |
| 179 | +- ``trainer=gpu``: training with CUDA, change ``devices`` to the number of GPUs you want to use. |
| 180 | +- ``trainer=ddp``: training with Distributed Data Parallel (DDP), change ``devices`` and ``num_nodes`` to the number of GPUs and number of nodes you want to use. |
| 181 | +- ``trainer=cpu``: training on the CPU (not recommended). |
| 182 | + |
| 183 | +#### Logging |
| 184 | +- ``trainer/logger=csv``: log the results in a csv file. Very basic functionality. |
| 185 | +- ``trainer/logger=wandb``: log the results in [wandb](https://wandb.ai/). This requires to install ``wandb`` and to set up your wandb account. This is what we used to log our experiments. |
| 186 | +- ``trainer/logger=<other>``: Other loggers (not tested). |
| 187 | + |
| 188 | +#### Machine |
| 189 | +- ``machine=server``: You can change the default path to the dataset folder and the batch size. You can create your own machine configuration by adding a new file in ``configs/machine``. |
| 190 | + |
| 191 | +#### Experiment |
| 192 | +There are many pre-defined experiments from the paper in ``configs/experiments``. Simply add ``experiment=<experiment>`` to the command line to use them. |
| 193 | + |
| 194 | +  |
| 195 | + |
| 196 | +</details> |
| 197 | + |
| 198 | +## Citation |
| 199 | +If you use this dataset and/or this code in your work, please cite our [paper](htto://TODO): |
| 200 | + |
| 201 | +```markdown |
| 202 | +@inproceedings{ventura23covr, |
| 203 | + title = {{CoVR}: Learning Composed Video Retrieval from Web Video Captions}, |
| 204 | + author = {Lucas Ventura and Antoine Yang and Cordelia Schmid and G{\"u}l Varol}, |
| 205 | + booktitle = {arXiv}, |
| 206 | + year = {2023} |
| 207 | + } |
| 208 | +``` |
| 209 | + |
| 210 | +## Acknowledgements |
| 211 | +Based on [BLIP](https://github.com/salesforce/BLIP/) and [lightning-hydra-template](https://github.com/ashleve/lightning-hydra-template/tree/main). |
| 212 | + |
0 commit comments