This guide provides instructions on how to download and prepare the datasets required for running the continual learning experiments.
Before starting, ensure you have the following Python packages installed (beyond the base LLaVA-NeXT requirements):
pip install datasets requests tqdm pillowFor Hugging Face datasets, you may need to login:
pip install huggingface_hub
huggingface-cli login # Follow prompts to enter your token-
Create a data directory: It's recommended to store all your datasets in a single location.
export DATA_DIR=/path/to/your/data/directory mkdir -p $DATA_DIR
-
Image Folder: The experiment scripts use an environment variable
IMAGE_FOLDERto locate the images for all datasets. All dataset preparation scripts place images in subdirectories within a main image folder. Set this path in the experiment scripts. A good practice is to have a central image folder.export IMAGE_FOLDER=$DATA_DIR/images mkdir -p $IMAGE_FOLDER
The experiment scripts in
scripts/all_experiments/final_experiments/often include a placeholder for this path and for dataset YAMLs; update them to match your setup. -
LLaVA-format JSONs: The training scripts expect dataset information in a specific LLaVA JSON format. The scripts below will generate these JSON files. These JSONs are then referenced by
.yamlfiles inscripts/all_experiments/. -
Update YAML configurations: After generating the JSON files, update the paths in the YAML files in
scripts/all_experiments/to point to your actual data locations. -
Security Note: Avoid hardcoding API keys, tokens, or private paths.
- Use environment variables or a local config file (gitignored) for sensitive info
- Replace any example keys in utility scripts with your own via env vars (e.g.,
FLICKR_API_KEY,FLICKR_API_SECRET) - Update all placeholder paths to your actual data locations
Below are instructions for preparing each dataset. The scripts will download the data and convert it to the required format.
The CUB-200 dataset is a collection of bird images.
-
Run the conversion script: The script downloads CUB-200-2011 if missing and converts to LLaVA format. Run it separately for each split:
# Train split python utils/cub200_to_llava.py \ --data_dir $DATA_DIR/datasets \ --output_dir $DATA_DIR/llava_json \ --image_dir $IMAGE_FOLDER/cub200 \ --data_split train # Test split python utils/cub200_to_llava.py \ --data_dir $DATA_DIR/datasets \ --output_dir $DATA_DIR/llava_json \ --image_dir $IMAGE_FOLDER/cub200 \ --data_split test
-
Create
cub200.yaml: Createscripts/all_experiments/cub200.yamlwith the following content (update the path to your generated JSON):datasets: - json_path: /path/to/your/data/directory/llava_json/cub200_train.json sampling_strategy: "all"
PixMo-Count is a dataset for object counting that requires downloading images from Flickr.
Since the images are hosted on Flickr, you'll need Flickr API credentials to download them reliably:
- Create a Flickr account at flickr.com if you don't have one.
- Apply for API access at Flickr's API page.
- Get your API key and secret from the Flickr API management page.
-
Update paths in the script: Open
utils/download_pixmocount_and_convert_to_llava.pyand modify the paths at the bottom:# In utils/download_pixmocount_and_convert_to_llava.py (bottom of file) if __name__ == "__main__": output_dir = "/path/to/your/data/directory/llava_json" image_dir = "/path/to/your/image/data/pixmo_count"
-
Provide Flickr API credentials (recommended): Export them as environment variables (preferred), then modify the script to read from env if needed:
export FLICKR_API_KEY="your_key" export FLICKR_API_SECRET="your_secret"
The current script has inline placeholders; replace them with
os.environ.get("FLICKR_API_KEY")andos.environ.get("FLICKR_API_SECRET"), or temporarily paste your keys locally (do not commit keys).Note: Without API keys, downloads may be rate limited.
python utils/download_pixmocount_and_convert_to_llava.pyThis script will:
- Download the PixMo-Count dataset metadata from Hugging Face (
allenai/pixmo-count) - Download images from Flickr using the provided URLs
- Convert the data to LLaVA format and save as
pixmo_count_train.json
The YAML file should already be present at scripts/all_experiments/pixmocount.yaml, but verify it points to the correct location:
datasets:
- json_path: /path/to/your/data/directory/llava_json/pixmo_count_train.json
sampling_strategy: "all"- Rate limiting errors: Ensure you have valid Flickr API credentials.
- Download failures: Some images may no longer be available on Flickr. The script includes retry logic and will skip unavailable images.
- Large download: The dataset contains thousands of images, so the download may take several hours.
The PathVQA dataset contains pathology images with questions and answers.
-
Run the conversion script:
# e.g., for the train split python utils/pathvqa_to_llava.py --output_dir $DATA_DIR/llava_json --image_dir $IMAGE_FOLDER/pathvqa --data_split train
Repeat with
--data_split validationand--data_split testas needed. Images are saved and JSONs are created per split. -
Create
pathvqa.yaml:datasets: - json_path: /path/to/your/data/directory/llava_json/pathvqa_train.json sampling_strategy: "all"
The TextVQA dataset requires reading text in images to answer questions.
-
Run the conversion script:
# e.g., for the train split python utils/textvqa_to_llava.py --output_dir $DATA_DIR/llava_json --image_dir $IMAGE_FOLDER/textvqa --data_split train
This downloads the dataset from Hugging Face and creates a LLaVA JSON for the selected split.
-
Create
textvqa.yaml:datasets: - json_path: /path/to/your/data/directory/llava_json/textvqa_train.json sampling_strategy: "all"
For evaluation, no local conversion is required. The tasks load curated datasets directly from Hugging Face:
timeclock:AvaXiao/clockreading-timecococlock:Jessemel/clockreading-cocoopenimgclock:Jessemel/clockreading-openimg
Ensure huggingface-cli login is configured if any dataset requires gated access.
Optional (advanced): If you want to build custom clock-reading datasets, utils/clockreading_to_llava.py shows an example conversion pipeline. Note that it currently contains hardcoded paths; edit base_dir and output_dir in the script before running.