lucas-ventura
diff --git a/‎.gitattributes‎
Lines changed: 2 additions & 0 deletions b/‎.gitattributes‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 9 additions & 0 deletions b/‎.gitignore‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 212 additions & 0 deletions b/‎README.md‎
Lines changed: 212 additions & 0 deletions
diff --git a/‎configs/data/cirr.yaml‎
Lines changed: 22 additions & 0 deletions b/‎configs/data/cirr.yaml‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎configs/data/fashioniq-base.yaml‎
Lines changed: 28 additions & 0 deletions b/‎configs/data/fashioniq-base.yaml‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎configs/data/fashioniq-dress.yaml‎
Lines changed: 4 additions & 0 deletions b/‎configs/data/fashioniq-dress.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎configs/data/fashioniq-shirt.yaml‎
Lines changed: 4 additions & 0 deletions b/‎configs/data/fashioniq-shirt.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎configs/data/fashioniq-toptee.yaml‎
Lines changed: 4 additions & 0 deletions b/‎configs/data/fashioniq-toptee.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎configs/data/webvid-covr.yaml‎
Lines changed: 26 additions & 0 deletions b/‎configs/data/webvid-covr.yaml‎
Lines changed: 26 additions & 0 deletions
@@ -0,0 +1,2 @@
+*.ipynb 
+tools/notebooks/
@@ -0,0 +1,9 @@
+outputs/
+datasets
+launching
+annotation/
+.vscode/
+bert-base-uncased/
+
+delete*
+__pycache__/
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 Lucas Ventura
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,212 @@
+<div align="center">
+
+# CoVR: Composed Video Retrieval
+## Learning Composed Video Retrieval from Web Video Captions
+
+![CoVR teaser gif](tools/examples/teaser.gif)
+
+</div>
+
+<div align="justify">
+
+> Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers _both_ text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR _triplets_ is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption _pairs_, while also expanding the scope of the task to include composed _video_ retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available.
+
+</div>
+
+## Description
+This repository contains the code for the paper ["CoVR: Learning Composed Video Retrieval from Web Video Captions"](https://arxiv.org/abs/2308.TODO).
+
+Please visit our [webpage](http://imagine.enpc.fr/~ventural/covr) for more details.
+
+This repository contains: 
+
+```markdown
+📦 covr
+ ┣ 📂 configs                 # hydra config files
+ ┣ 📂 src                     # Pytorch datamodules
+ ┣ 📂 tools                   # scrips and notebooks
+ ┣ 📜 .gitignore
+ ┣ 📜 LICENSE
+ ┣ 📜 README.md
+ ┣ 📜 test.py
+ ┗ 📜 train.py
+
+ ```
+
+## Installation :construction_worker:
+
+<details><summary>Create environment</summary>
+&emsp; 
+
+```bash
+conda create --name covr
+conda activate covr
+```
+
+Install the following packages inside the conda environment:
+
+```bash
+python -m pip install pytorch_lightning --upgrade
+python -m pip install hydra-core --upgrade
+python -m pip install lightning
+python -m pip install einops
+python -m pip install pandas
+python -m pip install opencv-python
+python -m pip install timm
+python -m pip install fairscale
+python -m pip install tabulate
+python -m pip install transformers
+```
+
+The code was tested on Python 3.8 and PyTorch 2.0.
+
+</details>
+
+<details><summary>Download the datasets</summary>
+
+### WebVid-CoVR
+To use the WebVid-CoVR dataset, you will have to download the WebVid videos and the WebVid-CoVR annotations.
+
+To download the annotations, run:
+```bash
+bash tools/scripts/download_annotations.sh covr
+```
+
+To download the videos, install [`mpi4py`](https://mpi4py.readthedocs.io/en/latest/install.html#) and run:
+```bash
+python tools/scripts/download_covr.py <split>
+```
+
+### CIRR
+To use the CIRR dataset, you will have to download the CIRR images and the CIRR annotations.
+
+To download the annotations, run:
+```bash
+bash tools/scripts/download_annotations.sh cirr
+```
+
+To download the images, follow the instructions in the [CIRR repository](https://github.com/lil-lab/nlvr/tree/master/nlvr2#direct-image-download). The default folder structure is the following:
+
+```markdown
+📦 covr
+ ┣ 📂 datasets  
+ ┃ ┣ 📂 CIRR
+ ┃ ┃ ┣ 📂 images
+ ┃ ┃ ┃ ┣ 📂 train
+ ┃ ┃ ┃ ┣ 📂 dev
+ ┃ ┃ ┃ ┗ 📂 test1
+```
+
+### FashionIQ
+To use the FashionIQ dataset, you will have to download the FashionIQ images and the FashionIQ annotations.
+
+To download the annotations, run:
+```bash
+bash tools/scripts/download_annotations.sh fiq
+```
+
+To download the images, the urls are in the [FashionIQ repository](https://github.com/hongwang600/fashion-iq-metadata/tree/master/image_url). You can use the [this script](https://github.com/yanbeic/VAL/blob/master/download_fashion_iq.py) to download the images. Some missing images can also be found [here](https://github.com/XiaoxiaoGuo/fashion-iq/issues/18). All the images should be placed in the same folder (``datasets/fashion-iq/images``).
+
+</details>
+
+
+<details><summary>(Optional) Download pre-trained models</summary>
+
+``
+
+To download the checkpoints, run:
+```bash
+bash tools/scripts/download_pretrained_models.sh
+```
+
+</details>
+
+
+## Usage :computer:
+<details><summary>Computing BLIP embeddings</summary>
+&emsp; 
+
+Before training, you will need to compute the BLIP embeddings for the videos/images. To do so, run:
+```bash
+python tools/embs/save_blip_embs_vids.py # This will compute the embeddings for the WebVid-CoVR videos.
+python tools/embs/save_blip_embs_imgs.py # This will compute the embeddings for the CIRR or FashionIQ images.
+```
+
+&emsp; 
+</details>
+
+
+<details><summary>Training</summary>
+&emsp; 
+
+The command to launch a training experiment is the folowing:
+```bash
+python train.py [OPTIONS]
+```
+The parsing is done by using the powerful [Hydra](https://github.com/facebookresearch/hydra) library. You can override anything in the configuration by passing arguments like ``foo=value`` or ``foo.bar=value``.
+
+&emsp; 
+</details>
+
+<details><summary>Evaluating</summary>
+&emsp; 
+
+The command to evaluate is the folowing:
+```bash
+python test.py test=<test> [OPTIONS]
+```
+&emsp; 
+</details>
+
+<details><summary>Options parameters</summary>
+
+#### Datasets:
+- ``data=webvid-covr``: WebVid-CoVR datasets.
+- ``data=cirr``: CIRR dataset.
+- ``data=fashioniq-split``: FashionIQ dataset, change ``split`` to ``dress``, ``shirt`` or ``toptee``.
+
+#### Tests:
+- ``test=all``: Test on WebVid-CoVR, CIRR and all three Fashion-IQ test sets.
+- ``test=webvid-covr``: Test on WebVid-CoVR.
+- ``test=cirr``: Test on CIRR.
+- ``test=fashioniq``: Test on all three Fashion-IQ test sets (``dress``, ``shirt`` and ``toptee``).
+
+#### Checkpoints:
+- ``model/ckpt=blip-l-coco``: Default checkpoint for BLIP-L finetuned on COCO.
+- ``model/ckpt=webvid-covr``: Default checkpoint for CoVR finetuned on WebVid-CoVR.
+
+#### Training
+- ``trainer=gpu``: training with CUDA, change ``devices`` to the number of GPUs you want to use.
+- ``trainer=ddp``: training with Distributed Data Parallel (DDP), change ``devices`` and ``num_nodes`` to the number of GPUs and number of nodes you want to use.
+- ``trainer=cpu``: training on the CPU (not recommended).
+
+#### Logging
+- ``trainer/logger=csv``: log the results in a csv file. Very basic functionality.
+- ``trainer/logger=wandb``: log the results in [wandb](https://wandb.ai/). This requires to install ``wandb`` and to set up your wandb account. This is what we used to log our experiments.
+- ``trainer/logger=<other>``: Other loggers (not tested).
+
+#### Machine
+- ``machine=server``: You can change the default path to the dataset folder and the batch size. You can create your own machine configuration by adding a new file in ``configs/machine``.
+
+#### Experiment
+There are many pre-defined experiments from the paper in ``configs/experiments``. Simply add ``experiment=<experiment>`` to the command line to use them. 
+
+&emsp; 
+
+</details>
+
+## Citation
+If you use this dataset and/or this code in your work, please cite our [paper](htto://TODO):
+
+```markdown
+@inproceedings{ventura23covr,
+    title     = {{CoVR}: Learning Composed Video Retrieval from Web Video Captions},
+    author    = {Lucas Ventura and Antoine Yang and Cordelia Schmid and G{\"u}l Varol},
+    booktitle = {arXiv},
+    year      = {2023}
+  }
+```
+
+## Acknowledgements
+Based on [BLIP](https://github.com/salesforce/BLIP/) and [lightning-hydra-template](https://github.com/ashleve/lightning-hydra-template/tree/main).
+
@@ -0,0 +1,22 @@
+dataname: cirr
+_target_: src.data.cirr.CIRRDataModule
+
+# Paths
+dataset_dir: ${paths.datasets_dir}/CIRR
+
+batch_size: ${machine.batch_size}
+num_workers: ${machine.num_workers}
+
+annotation:
+  train: ${paths.work_dir}/annotation/cirr/cap.rc2.train.json
+  val: ${paths.work_dir}/annotation/cirr/cap.rc2.val.json
+
+img_dirs:
+  train: ${data.dataset_dir}/images/train
+  val: ${data.dataset_dir}/images/dev
+
+emb_dirs:
+  train: ${data.dataset_dir}/blip-embs-large/train
+  val: ${data.dataset_dir}/blip-embs-large/dev
+
+image_size: 384
@@ -0,0 +1,28 @@
+dataname: fashioniq-${data.category}
+_target_: src.data.fashioniq.FashionIQDataModule
+
+# Paths
+dataset_dir: ${paths.datasets_dir}/fashion-iq
+
+batch_size: ${machine.batch_size}
+num_workers: ${machine.num_workers}
+
+annotation:
+  train: ${paths.work_dir}/annotation/fashion-iq/cap.${data.category}.train.json
+  val: ${paths.work_dir}/annotation/fashion-iq/cap.${data.category}.val.json
+
+targets:
+  train: ${paths.work_dir}/annotation/fashion-iq/split.${data.category}.train.json
+  val: ${paths.work_dir}/annotation/fashion-iq/split.${data.category}.val.json
+
+img_dirs:
+  train: ${data.dataset_dir}/images/
+  val: ${data.dataset_dir}/images/
+
+emb_dirs:
+  train: ${data.dataset_dir}/blip-embs-large/
+  val: ${data.dataset_dir}/blip-embs-large/
+
+image_size: 384
+
+category: ???
@@ -0,0 +1,4 @@
+defaults:
+  - fashioniq-base.yaml
+
+category: dress
@@ -0,0 +1,4 @@
+defaults:
+  - fashioniq-base.yaml
+
+category: shirt
@@ -0,0 +1,4 @@
+defaults:
+  - fashioniq-base.yaml
+
+category: toptee
@@ -0,0 +1,26 @@
+dataname: webvid-covr
+_target_: src.data.webvid_covr.WebVidCoVRDataModule
+
+image_size: 384
+iterate: "pth2"
+vid_query_method: middle
+vid_frames: 1
+emb_pool: query
+
+# Paths
+dataset_dir: ${paths.datasets_dir}/WebVid
+
+batch_size: ${machine.batch_size}
+num_workers: ${machine.num_workers}
+
+annotation:
+  train: ${paths.work_dir}/annotation/webvid-covr/webvid2m-covr_train.csv
+  val: ${paths.work_dir}/annotation/webvid-covr/webvid8m-covr_val.csv
+
+vid_dirs:
+  train: ${data.dataset_dir}/2M/train
+  val: ${data.dataset_dir}/8M/train
+
+emb_dirs:
+  train: ${data.dataset_dir}/2M/blip-vid-embs-${model.model.vit}-all
+  val: ${data.dataset_dir}/8M/blip-vid-embs-${model.model.vit}-all