ViT-Based Image Captioning

A modern, from-scratch implementation of an Image Captioning system using a Vision Transformer (ViT) encoder and a Transformer Decoder. This project demonstrates the power of attention mechanisms in both computer vision and natural language processing.

🚀 Project Overview

The goal of this project was to build a pure "scratch-built" Transformer-based system to understand the mechanics of cross-modal attention. Unlike traditional approaches using pre-trained ResNet/CNNs, this model learns feature representation directly from image patches using a ViT architecture.

Key Features

Pure Transformer Architecture: Both encoder (ViT) and decoder are Transformer-based.
Beam Search (k=5): Implemented for inference to ensure linguistically coherent and higher-quality captions compared to greedy search.
Advanced Training: Features Automatic Mixed Precision (AMP), Label Smoothing, AdamW optimizer with Weight Decay, and intense Data Augmentation.
Parquet Support: Directly processes high-performance Parquet datasets from HuggingFace.

📂 Project Structure

├── checkpoints/          # Saved model weights (.pth files)
├── models/               # Model architecture definitions
│   ├── __init__.py
│   ├── vit.py            # Vision Transformer Encoder
│   ├── decoder.py        # Transformer Decoder
│   ├── bridge.py         # Dimension alignment & normalization
│   └── model.py          # Full ImageCaptioningModel class
├── scripts/              # Utility and testing scripts
│   ├── inference.py      # Generate captions for custom images
│   ├── test_beam.py      # Evaluate metrics using Beam Search
│   ├── test_model.py     # Standard greedy evaluation
│   └── sample_test.jpg   # Sample image for testing
├── utils/                # Data processing utilities
│   ├── __init__.py
│   └── data_loader.py    # Vocabulary & FlickrDataset logic
├── train.py              # Main training entry point
└── requirements.txt      # Project dependencies

📊 Current Status

What's Working

Full Pipeline: Data loading, training loop, validation, and inference are fully operational and optimized for 6GB+ VRAM cards.
Beam Search: Significant improvement in caption quality (+5 BLEU points over greedy search).
Environment Recognition: The model is very good at identifying background contexts (e.g., "snow," "beach," "mountain").

What's Not Good (Yet)

Object Specificity: Since the ViT is trained from scratch on the relatively small Flickr8k dataset, it sometimes confuses similar objects (e.g., labeling "dogs" as "people" or "cats").
Vocabulary Limit: Limited to common words found in the 8,000 images of the Flickr8k dataset.

🛠️ Installation & Usage

1. Requirements

Ensure you have Python 3.10+ and CUDA installed. Install dependencies:

pip install -r requirements.txt

2. Training

Run the training script to rebuild the vocabulary and start learning:

python train.py

3. Inference

To generate a caption for your own image using the best-trained model:

python scripts/inference.py --image "path/to/your/image.jpg"

4. Evaluation

To run the full BLEU metric benchmark on the 1,000 image test set:

python scripts/test_beam.py

🔮 Future Improvements

Transfer Learning: Integrate pre-trained weights (ResNet-50 or ImageNet-ViT) to drastically improve object recognition.
Larger Datasets: Scale up to Flickr30k or MS-COCO for a more robust vocabulary.
Attention Visualization: Add Grad-CAM or heatmap visualizations to see which part of the image the decoder is looking at while generating words.
Patch Size Reduction: Move from 16x16 to 8x8 patches for higher-resolution feature extraction.

🎯 Project Goal

To explore the boundaries of Transformers in Vision. By avoiding pre-trained models initially, this project serves as a deep dive into how raw pixels can be transformed into meaningful human language through the power of Self-Attention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViT-Based Image Captioning

🚀 Project Overview

Key Features

📂 Project Structure

📊 Current Status

What's Working

What's Not Good (Yet)

🛠️ Installation & Usage

1. Requirements

2. Training

3. Inference

4. Evaluation

🔮 Future Improvements

🎯 Project Goal

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

ViT-Based Image Captioning

🚀 Project Overview

Key Features

📂 Project Structure

📊 Current Status

What's Working

What's Not Good (Yet)

🛠️ Installation & Usage

1. Requirements

2. Training

3. Inference

4. Evaluation

🔮 Future Improvements

🎯 Project Goal

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages