Skip to content

SuyashR25/Image-Captioning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViT-Based Image Captioning

A modern, from-scratch implementation of an Image Captioning system using a Vision Transformer (ViT) encoder and a Transformer Decoder. This project demonstrates the power of attention mechanisms in both computer vision and natural language processing.

🚀 Project Overview

The goal of this project was to build a pure "scratch-built" Transformer-based system to understand the mechanics of cross-modal attention. Unlike traditional approaches using pre-trained ResNet/CNNs, this model learns feature representation directly from image patches using a ViT architecture.

Key Features

  • Pure Transformer Architecture: Both encoder (ViT) and decoder are Transformer-based.
  • Beam Search (k=5): Implemented for inference to ensure linguistically coherent and higher-quality captions compared to greedy search.
  • Advanced Training: Features Automatic Mixed Precision (AMP), Label Smoothing, AdamW optimizer with Weight Decay, and intense Data Augmentation.
  • Parquet Support: Directly processes high-performance Parquet datasets from HuggingFace.

📂 Project Structure

├── checkpoints/          # Saved model weights (.pth files)
├── models/               # Model architecture definitions
│   ├── __init__.py
│   ├── vit.py            # Vision Transformer Encoder
│   ├── decoder.py        # Transformer Decoder
│   ├── bridge.py         # Dimension alignment & normalization
│   └── model.py          # Full ImageCaptioningModel class
├── scripts/              # Utility and testing scripts
│   ├── inference.py      # Generate captions for custom images
│   ├── test_beam.py      # Evaluate metrics using Beam Search
│   ├── test_model.py     # Standard greedy evaluation
│   └── sample_test.jpg   # Sample image for testing
├── utils/                # Data processing utilities
│   ├── __init__.py
│   └── data_loader.py    # Vocabulary & FlickrDataset logic
├── train.py              # Main training entry point
└── requirements.txt      # Project dependencies

📊 Current Status

What's Working

  • Full Pipeline: Data loading, training loop, validation, and inference are fully operational and optimized for 6GB+ VRAM cards.
  • Beam Search: Significant improvement in caption quality (+5 BLEU points over greedy search).
  • Environment Recognition: The model is very good at identifying background contexts (e.g., "snow," "beach," "mountain").

What's Not Good (Yet)

  • Object Specificity: Since the ViT is trained from scratch on the relatively small Flickr8k dataset, it sometimes confuses similar objects (e.g., labeling "dogs" as "people" or "cats").
  • Vocabulary Limit: Limited to common words found in the 8,000 images of the Flickr8k dataset.

🛠️ Installation & Usage

1. Requirements

Ensure you have Python 3.10+ and CUDA installed. Install dependencies:

pip install -r requirements.txt

2. Training

Run the training script to rebuild the vocabulary and start learning:

python train.py

3. Inference

To generate a caption for your own image using the best-trained model:

python scripts/inference.py --image "path/to/your/image.jpg"

4. Evaluation

To run the full BLEU metric benchmark on the 1,000 image test set:

python scripts/test_beam.py

🔮 Future Improvements

  1. Transfer Learning: Integrate pre-trained weights (ResNet-50 or ImageNet-ViT) to drastically improve object recognition.
  2. Larger Datasets: Scale up to Flickr30k or MS-COCO for a more robust vocabulary.
  3. Attention Visualization: Add Grad-CAM or heatmap visualizations to see which part of the image the decoder is looking at while generating words.
  4. Patch Size Reduction: Move from 16x16 to 8x8 patches for higher-resolution feature extraction.

🎯 Project Goal

To explore the boundaries of Transformers in Vision. By avoiding pre-trained models initially, this project serves as a deep dive into how raw pixels can be transformed into meaningful human language through the power of Self-Attention.

About

A Transformer based Image Captioning model built and trained from scratch using ViT and Transformer Decoder.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages